On the size of minimal automata for approximate string matching
نویسنده
چکیده
A natural way to solve the problem of string matching with k mismatches, is to construct a nite automaton recognizing the language L P;k of all strings being at a distance k of the searched pattern P, and to use it for a linear search in the text. The problem of this approach is a high space complexity. In this paper, we show that, even if we consider the minimal DFA recognizing L P;k , the memory space required remains large, and the number of states C of the minimal automaton increases quickly with the size m of P. For a pattern composed of the repetition of one character, the exact number of states is C = ? m+1 k+1. For a pattern composed of characters that are all diierent, an accurate lower bound on C is P k i=1 i b i , where, for all i, b i is the (i + 1) th Catalan number and i is a positive integer depending on m and k. For a random pattern P, a lower bound is obtained by considering the longest preex of P consisting, either in the repetition of a single character, or in characters that are all diierent.
منابع مشابه
Space Complexity of Linear Time Approximate String Matching
Approximate string matching is a sequential problem and therefore it is possible to solve it using nite automata. Nondeterministic nite automata are constructed for string matching with k mismatches and k di erences. The corresponding deterministic nite automata are base for approximate string matching in linear time. Then the space complexity of both types of deterministic automata is calculat...
متن کاملReduced Nondeterministic Finite Automata for Approximate String Matching
We will show how to reduce the number of states of nondeterministic nite automata for approximate string matching with k mismatches and nondeterministic nite automata for approximate string matching with k differences in the case when we do not need to know how many mismatches or di erences are in the found string. Also we will show impact of this reduction on Shift-Or based algorithms.
متن کاملEfficient generation of super condensed neighborhoods
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Condensed neighborhoods, however, are not a minimal representation of a pattern neighborhood. Super condensed neighborhoods, proposed in this work, are smaller, provably minimal and can be used to locate approximate matches that can later be extended by on-line search. We...
متن کاملFaster Generation of Super Condensed Neighbourhoods Using Finite Automata
We present a new algorithm for generating super condensed neighbourhoods. Super condensed neighbourhoods have recently been presented as the minimal set of words that represent a pattern neighbourhood. These sets play an important role in the generation phase of hybrid algorithms for indexed approximate string matching. An existing algorithm for this purpose is based on a dynamic programming ap...
متن کاملApproximate Regular Expression Matching
We extend the de nition of Hamming and Levenshtein distance between two strings used in approximate string matching so that these two distances can be used also in approximate regular expression matching. Next, the methods of construction of nondeterministic nite automata for approximate regular expression matching considering both mentioned distances are presented.
متن کامل